138 research outputs found
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval
In this paper we tackle the cross-modal video retrieval problem and, more
specifically, we focus on text-to-video retrieval. We investigate how to
optimally combine multiple diverse textual and visual features into feature
pairs that lead to generating multiple joint feature spaces, which encode
text-video pairs into comparable representations. To learn these
representations our proposed network architecture is trained by following a
multiple space learning procedure. Moreover, at the retrieval stage, we
introduce additional softmax operations for revising the inferred query-video
similarities. Extensive experiments in several setups based on three
large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to
best combine text-visual features and document the performance of the proposed
network. Source code is made publicly available at:
https://github.com/bmezaris/TextToVideoRetrieval-TtimesVComment: Accepted for publication; to be included in Proc. ECCV Workshops
2022. The version posted here is the "submitted manuscript" versio
Learning to detect video events from zero or very few video examples
In this work we deal with the problem of high-level event detection in video.
Specifically, we study the challenging problems of i) learning to detect video
events from solely a textual description of the event, without using any
positive video examples, and ii) additionally exploiting very few positive
training samples together with a small number of ``related'' videos. For
learning only from an event's textual description, we first identify a general
learning framework and then study the impact of different design choices for
various stages of this framework. For additionally learning from example
videos, when true positive training samples are scarce, we employ an extension
of the Support Vector Machine that allows us to exploit ``related'' event
videos by automatically introducing different weights for subsets of the videos
in the overall training set. Experimental evaluations performed on the
large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness
of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for
publicatio
Combination of Accumulated Motion and Color Segmentation for Human Activity Analysis
The automated analysis of activity in digital multimedia, and especially video, is gaining more and more importance due to the evolution of higher-level video processing systems and the development of relevant applications such as surveillance and sports. This paper presents a novel algorithm for the recognition and classification of human activities, which employs motion and color characteristics in a complementary manner, so as to extract the most information from both sources, and overcome their individual limitations. The proposed method accumulates the flow estimates in a video, and extracts “regions of activity†by processing their higher-order statistics. The shape of these activity areas can be used for the classification of the human activities and events taking place in a video and the subsequent extraction of higher-level semantics. Color segmentation of the active and static areas of each video frame is performed to complement this information. The color layers in the activity and background areas are compared using the earth mover's distance, in order to achieve accurate object segmentation. Thus, unlike much existing work on human activity analysis, the proposed approach is based on general color and motion processing methods, and not on specific models of the human body and its kinematics. The combined use of color and motion information increases the method robustness to illumination variations and measurement noise. Consequently, the proposed approach can lead to higher-level information about human activities, but its applicability is not limited to specific human actions. We present experiments with various real video sequences, from sports and surveillance domains, to demonstrate the effectiveness of our approach
Filter-Pruning of Lightweight Face Detectors Using a Geometric Median Criterion
Face detectors are becoming a crucial component of many applications,
including surveillance, that often have to run on edge devices with limited
processing power and memory. Therefore, there's a pressing demand for compact
face detection models that can function efficiently across resource-constrained
devices. Over recent years, network pruning techniques have attracted a lot of
attention from researchers. These methods haven't been well examined in the
context of face detectors, despite their expanding popularity. In this paper,
we implement filter pruning on two already small and compact face detectors,
named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face
Detector). The main pruning algorithm that we utilize is Filter Pruning via
Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative
procedure. We also apply L1 Norm pruning, as a baseline to compare with the
proposed approach. The experimental evaluation on the WIDER FACE dataset
indicates that the proposed approach has the potential to further reduce the
model size of already lightweight face detectors, with limited accuracy loss,
or even with small accuracy gain for low pruning rates.Comment: Accepted for publication in the IEEE/CVF WACV 2024 Workshops
proceedings, Hawaii, USA, Jan. 202
Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism
In this paper two new learning-based eXplainable AI (XAI) methods for deep
convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and
L-CAM-Img, are proposed. Both methods use an attention mechanism that is
inserted in the original (frozen) DCNN and is trained to derive class
activation maps (CAMs) from the last convolutional layer's feature maps. During
training, CAMs are applied to the feature maps (L-CAM-Fm) or the input image
(L-CAM-Img) forcing the attention mechanism to learn the image regions
explaining the DCNN's outcome. Experimental evaluation on ImageNet shows that
the proposed methods achieve competitive results while requiring a single
forward pass at the inference stage. Moreover, based on the derived
explanations a comprehensive qualitative analysis is performed providing
valuable insight for understanding the reasons behind classification errors,
including possible dataset biases affecting the trained classifier.Comment: Accepted for publication; to be included in Proc. ECCV Workshops
2022. The version posted here is the "submitted manuscript" versio
Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition
In this paper, we introduce Masked Feature Modelling (MFM), a novel approach
for the unsupervised pre-training of a Graph Attention Network (GAT) block. MFM
utilizes a pretrained Visual Tokenizer to reconstruct masked features of
objects within a video, leveraging the MiniKinetics dataset. We then
incorporate the pre-trained GAT block into a state-of-the-art bottom-up
supervised video-event recognition architecture, ViGAT, to improve the model's
starting point and overall accuracy. Experimental evaluations on the YLI-MED
dataset demonstrate the effectiveness of MFM in improving event recognition
performance.Comment: 8 page
An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos
In this work, we present an integrated system for spatiotemporal
summarization of 360-degrees videos. The video summary production mainly
involves the detection of salient events and their synopsis into a concise
summary. The analysis relies on state-of-the-art methods for saliency detection
in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It
also contains a mechanism that classifies a 360-degrees video based on the use
of static or moving camera during recording and decides which saliency
detection method will be used, as well as a 2D video production component that
is responsible to create a conventional 2D video containing the salient events
in the 360-degrees video. Quantitative evaluations using two datasets for
360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the
accuracy and positive impact of the developed decision mechanism, and justify
our choice to use two different methods for detecting the salient events. A
qualitative analysis using content from these datasets, gives further insights
about the functionality of the decision mechanism, shows the pros and cons of
each used saliency detection method and demonstrates the advanced performance
of the trained summarization method against a more conventional approach.Comment: Accepted for publication, 30th Int. Conf. on MultiMedia Modeling (MMM
2024), Amsterdam, NL, Jan.-Feb. 2024. This is the "submitted manuscript"
versio
Facilitating the Production of Well-tailored Video Summaries for Sharing on Social Media
This paper presents a web-based tool that facilitates the production of
tailored summaries for online sharing on social media. Through an interactive
user interface, it supports a ``one-click'' video summarization process. Based
on the integrated AI models for video summarization and aspect ratio
transformation, it facilitates the generation of multiple summaries of a
full-length video according to the needs of target platforms with regard to the
video's length and aspect ratio.Comment: Accepted for publication, 30th Int. Conf. on MultiMedia Modeling (MMM
2024), Amsterdam, NL, Jan.-Feb. 2024. This is the "submitted manuscript"
versio
Combining textual and visual information processing for interactive video retrieval: SCHEMA's participation in TRECVID 2004
In this paper, the two different applications based on the Schema Reference System that were developed by the SCHEMA NoE for participation to the search task of TRECVID 2004 are illustrated. The first application, named ”Schema-Text”, is an interactive retrieval application that employs only textual information while the second one, named ”Schema-XM”, is an extension of the former, employing algorithms and
methods for combining textual, visual and higher level information. Two runs for each application were submitted, I A 2 SCHEMA-Text 3, I A 2 SCHEMA-Text 4 for Schema-Text and I A 2 SCHEMA-XM 1, I A 2 SCHEMA-XM 2 for Schema-XM. The comparison of these two applications in terms of retrieval efficiency revealed that the combination of information from different data sources can provide higher efficiency for retrieval systems. Experimental testing additionally revealed that initially performing a text-based query and subsequently proceeding with visual similarity search using one of the returned relevant keyframes as an example image is a good scheme for combining visual and textual information
- …